Skip to content

fix(deploy): prevent concurrent project mutations from half-deleting deployments#144

Open
uittenbroekrobbert wants to merge 1 commit into
mainfrom
fix/deployment-delete-race-prevention
Open

fix(deploy): prevent concurrent project mutations from half-deleting deployments#144
uittenbroekrobbert wants to merge 1 commit into
mainfrom
fix/deployment-delete-race-prevention

Conversation

@uittenbroekrobbert

Copy link
Copy Markdown
Contributor

Why

In production, a delete of deployment pr-36 in project toets-hn7 failed terminally: two sibling deployment deletes of the same project ran concurrently, collided on the shared project file during git push, hit an unrecoverable rebase conflict, and left the deployment half-deleted. Only a blind task retry recovered it. The same bug class produced a durable orphan that nothing detected: toets-hn7/pr-32 was removed from the project file on 2026-06-12 but its ArgoCD Application and pods kept running for 12 days.

What

1a. Serialize project-file-mutating tasks per project. The in-flight claim guard keyed on project_name and deployment_name, so sibling deployments of one project ran concurrently and raced on projects/<project>.yaml and the per-project ArgoCD kustomization.yaml. The guard now serializes any two project-file-mutating tasks of the same project (PROJECT_FILE_MUTATING_TASK_TYPES). Clone/backup/restore stay per-deployment so a slow restore never blocks deploys. Cross-project work still runs in parallel.

1b. Self-heal the project-file push on a true rebase conflict. GitConnector.push_changes/commit_and_push accept an optional reapply callback. On an unmergeable rebase it hard-resets to the current remote, re-applies the intended change on fresh content, and retries, instead of failing terminally with a bare RuntimeError. A new typed GitPushConflictError is raised when no reapply is given. The deployment-delete path passes a reapply that re-reads the project file fresh and re-removes the deployment. This also covers the auto-tuner, which commits to project files outside the task system.

2a. Read-only deployment drift report. GET /api/v2/admin/deployments/drift compares deployments declared in project files against live ArgoCD Applications and reports orphaned_deployments (live but undeclared, the pr-32 case) and missing_deployments. Zero mutations. Project-infrastructure apps ({project}-infrastructure) are excluded.

No behaviour change for single, non-concurrent operations.

Testing

  • Real Postgres (test_async_task_claim_serialization_db.py, requires_infra): proves the actual claim SQL on a live DB. The pr-36/pr-37 scenario claims only one; backups are not over-serialized; cross-project deletes both claim.
  • Real git (test_git_push_conflict_integration.py): forces an actual rebase conflict against a local bare remote and proves the reapply path converges while preserving the concurrent writer's change.
  • Mocked unit tests for the claim-guard query wiring, the push self-heal control flow + typed raise, and drift classification.
  • Live production data: ran the drift classifier against 93 real ArgoCD apps + 36 project files; it returns exactly the one real orphan (toets-hn7/pr-32), zero false positives, zero missing.

Follow-up (not in this PR)

  • Remediate the existing toets-hn7/pr-32 orphan (runbook in features/deployment-delete-race-prevention.md) and clear any stale marked_for_deletion row for toets-hn7/pr-36.
  • Optionally wire the reapply hook into the upsert/update-image/auto-tune writers for full coverage.

…deployments

Two sibling deployment deletes of the same project ran concurrently, collided on
the shared project file during git push, hit an unrecoverable rebase conflict,
and left the deployment half-deleted (toets-hn7/pr-36). The same bug class
produced a durable orphan that nothing detected (toets-hn7/pr-32).

- Serialize project-file-mutating tasks per project in the claim guard, so
  deploys/updates/deletes of one project no longer race on projects/<p>.yaml and
  the per-project ArgoCD kustomization. Backups/restores stay per-deployment.
- Self-heal the project-file push: on an unmergeable rebase, reset to the current
  remote, re-apply the intended change, and retry instead of failing terminally
  (new typed GitPushConflictError + reapply hook, wired into the delete path).
- Add a read-only GET /api/v2/admin/deployments/drift report that flags
  deployments live in ArgoCD but absent from the project file.

Verified with a real Postgres (claim serialization), real git (push self-heal
convergence), and against live production data (drift report flags pr-32 only).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant